Search CORE

7 research outputs found

Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts

Author: AB Clegg
C Nedellec
D Klein
D Rebholz-Schuhmann
E Charniak
H Jose
Hans-Werner Mewes
I Donaldson
J Tsujii
J-H Eom
Jason Weston
K Fundel
L Hirschman
M Lease
M Palmer
Mark Isalan
R Collobert
R Collobert
R Hoffmann
Ronan Collobert
RT-H Tsai
S Bethard
S Pradhan
TH Tsai
Thorsten Barnickel
Volker Stümpflen
Y Kogan
Y Miyao
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

PuSH

Evaluation of SENNA* on LLL'05 and BC-PPI corpus.

Author: Hans-Werner Mewes (3053)
Jason Weston (74291)
Ronan Collobert (371495)
Thorsten Barnickel (371494)
Volker Stümpflen (13745)
Publication venue
Publication date
Field of study

Precision and recall of RE step applied on the LLL'05 and BC-PPI data set. All TPs, FPs and FN in both data sets were summed up for an overall value for precision, recall and F-measure.</p

FigShare

Performance comparison of SENNA with common SRL programs and syntactic parsers.

Author: Hans-Werner Mewes (3053)
Jason Weston (74291)
Ronan Collobert (371495)
Thorsten Barnickel (371494)
Volker Stümpflen (13745)
Publication venue
Publication date
Field of study

Performance time (ptime) of SENNA* (the SENNA variant we used for PAS generation), ASSERT and Stanford PCFG and lexicalized parser were measured relative to SENNA 1.0 web version on four test sets of 500 sentences each. The length interval of the sentences ranged from 65–75 characters for the first test set to 235–245 characters in the fourth test set. The Enju-mogura parser appeared to have difficulties specifically with the 175 character test set we used, processing times on other sentences of similar sentence length resulted in processing times comparable to the mogura results on the 65, 135 and 235 characters test sets.</p

FigShare

Determination of the average sentence length in the test sets as well as in the three sources of biomedical literature used for PAS extraction.

Author: Hans-Werner Mewes (3053)
Jason Weston (74291)
Ronan Collobert (371495)
Thorsten Barnickel (371494)
Volker Stümpflen (13745)
Publication venue
Publication date
Field of study

Evaluation of average sentence length in characters for different literature resources in the biomedical domain.*)including titles.**)excluding titles.</p

FigShare

Histogram of the fraction of wrongly predicted verbs covering the 5000 most frequent verb-candidates.

Author: Hans-Werner Mewes (3053)
Jason Weston (74291)
Ronan Collobert (371495)
Thorsten Barnickel (371494)
Volker Stümpflen (13745)
Publication venue
Publication date
Field of study

After checking those 5000 verb-candidates manually for false verb assignments, the candidates were grouped in 100 subsets of 50 verb-candidates. For each group the fraction of verb-candidates wrongly labeled as “verb” by SENNA was evaluated (y-axis). The histogram shows these 100 subsets ordered by descending candidate – frequency from left to right. With decreasing term frequency, the number of wrong assignment rises.</p

FigShare